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Abstract — The normalized radial basis function neural net- 
work emerges in the statistical modeling of natural laws that 
relate components of multivariate data. The modeling is based 
on the kernel estimator of the joint probability density function 
pertaining to given data. From this function a governing law is ex- 
tracted by the conditional average estimator. The corresponding 
nonparametric regression represents a normalized radial basis 
function neural network and can be related with the multi- 
layer perceptron equation. In this article an exact equivalence 
of both paradigms is demonstrated for a one-dimensional case 
with symmetric triangular basis functions. The transformation 
provides for a simple interpretation of perceptron parameters in 
terms of statistical samples of multivariate data. 

Index Terms — kernel estimator, conditional average, normal- 
ized radial basis function neural network, perceptron 



I. Introduction 

Multi-layer perceptrons (MLP) have played a central role 
in the research of neural networks [1], [2]. Their study began 
with the nonlinear and adaptive response characteristics of 
neurons, which have brought with them many difficulties 
related to the understanding of the collective properties of 
MLPs. Consequently, it was discovered rather late that the 
MLP is a universal approximator of relations between input 
signals [1], [2], [3]. However, supervised training of MLPs 
by back-propagation of errors is relatively time-consuming 
and does not provide a simple interpretation of MLP param- 
eters. The inclusion of a priori information into an MLP is 
also problematic. Many of these problems do not appear in 
simulations of radial basis function neural networks (RBFN) 
[4]. The structure of the normalized RBFN stems from the 
representation of the empirical probability density function of 
sensory signals in terms of prototype data and can simply be 
interpreted statistically [5]. An optimal description of relations 
is described in this case by the conditional average estimator 
(CA), which represents a general, non-linear regression and 
corresponds to a normalized RBFN. A priori information can 
also be included in this model by initialization of prototypes. 
A learning rule derived from the maximum entropy principle 
describes a self-organized adaptation of neural receptive fields 
[5], [6], [7]. The separation of input signals into independent 
and dependent variables need not be done before training, as 
with MLPs, but it can be performed when applying a trained 
network. Because of these convenient properties of RBFNs, 
our aim was to compare both NN paradigms and to explore 
whether RBFN is equivalent to MLP with respect to mod- 
eling of mapping relations. Here we demonstrate their exact 



equivalence for a simple one-dimensional case by showing that 
the mapping relation of an RBFN can be directly transformed 
into that of an MLP, and vice versa. This further indicates 
how MLP parameters can also be statistically interpreted in 
the case of multivariate data. 

II. Estimation of probability density functions 

The task of both paradigms is the modeling of relations 
between components of measured data. We assume that D 
sensors provide signals (si, S2, ■ ■ ■ , Sd) that comprise a vector 
x. The modeling is here based on an estimation of the joint 
probability density function (PDF) of vector x. We assume that 
information about the probability distribution is obtained by a 
repetition of measurements that yield N independent samples 
{xi,X2, . . . ,Xjv}- The PDF is then described by the kernel 
estimator [6]: 



1 



N 



/eW = — 2^w(x-X„,cr) 

n=l 



(1) 



Here the kernel w(x — x„, a) is a smooth approximation of the 
delta function, such as a radially symmetric Gaussian function 
w(x.,(t) = const. exp(— || x || 2 /2cr 2 ). The constant a can 
be objectively interpreted as the width of a scattering function 
describing stochastic fluctuations in the channels of a data 
acquisition system and can be determined by a calibration 
procedure [5], [7], [8]. 

However, in an application the complete PDF need not be 
stored; it is sufficient to preserve a set of statistical samples 
x„. In order to obtain a smooth estimator of the PDF, the 
neighboring sample points should be separated in the sample 
space by approximately ~ a. From this condition one can 
estimate a proper number N of samples [5], [7], [8]. In 
a continuous measurement the number of samples increases 
without limit, and there arises a problem with the finite 
capacity of the memory in which the data are stored. Neural 
networks are composed of finite numbers of memory cells, and 
therefore we must assume that the PDF can be represented by 
a finite number K of prototype vectors {qi, q2, . . . , qx} as 

1 K 

/r(x) = — ^ui(x-q fe ,cr) 



(2) 



fc=i 



In the modeling of f r the prototypes are first initialized by 
K samples: {q^ = Xfc , k — 1 . . . K}, which represent a 
priori given information. These prototypes can be adapted to 
additional samples xjv in such a way that the mean-square 



difference between f e and f r is minimized. The corresponding 
rule was derived elsewhere, and it describes the self-organized 
unsupervised learning of neurons, each of which contains one 
prototype q fc [6], [7]. 

The estimator of the PDF given in Eq. [2] can be simply 
generalized by assuming that various prototypes are associated 
with different probabilities and receptive fields [7]: 1/K i— > pk 
and a i— ► a k . This substitution yields a generalized model: 

K 

/s( x ) = ^2Pkw(x- q fe ,(Tfe) (3) 

k=l 

However, in this case several advantages of a simple interpreta- 
tion of the model Eq. [2] are lost, which causes problems when 
analyzing its relation to the perceptron model. Therefore, we 
further consider the simpler model given in Eq. |2 
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Fig. 1 . An example of a linear interpolating function through sample points. 



III. Conditional average 

In the application of an adapted PDF the information must 
be extracted from prototypes, which generally corresponds to 
some kind of statistical estimation. In a typical application 
there is some partial information given, for instance the first 
i components of the vector: g = (si, S2, ■■, Si, 0), while the 
hidden data, which have to be estimated, are then represented 
by the vector h = (0, Si+i, .., sd) [5], [8]. Here denotes the 
missing part in a truncated vector. As an optimal estimator 
we apply the conditional average, which can be expressed by 
prototype vectors as [4], [5], [8]: 



K 

E 

fc=i 



where 



B k (g) 



w(g - gfc,cr) 



(4) 



(5) 



Here the given vector g plays the role of the given condition. 
The basis functions Sfe(g) are strongly nonlinear and peaked 
at the truncated vectors g&. They represent the measure of 
similarity between the given vector g and the prototypes g&. 

The CA represents a general non-linear, non-parametric 
regression, which has already been successfully applied in a 
variety of fields [5], [7], [8]. It is important that selection into 
given and hidden data can be done after training the network, 
which essentially contributes to the adaptability of the method 
to various tasks in an application [5]. 

The CA corresponds to a mapping relation g — > h that can 
be realized by a two-layer RBFN [4]. The first layer consists 
of K neurons. The fc-th neuron obtains the input signal g 
over synapses described by g^ and is excited as described by 
the radial basis function Bfc(g). The corresponding excitation 
signal is then transferred to the neurons of the second layer. 
The i-th neuron of this layer has synaptic weights h k .i and 
generates the output /ii(g). 

IV. Transition from RBFN to MLP 

In order to obtain a relation with an MLP it is instruc- 
tive to analyze the performance of the RBFN in a simple 
two-dimensional case, for example as shown in Fig. Q] We 
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Fig. 2. Examples of a triangular and a piecewise linear sigmoidal basis 
functions. 



consider the function y(x) described by a set of sample pairs 
{x%, yi; . . .Xi, yf, . . . ; xn, j/at} with constant spacing between 



the sample points: Aa; 



_i - Xj for j = 1 . . .N - 1. We 



further introduce a triangular and a piecewise linear sigmoidal 
basis function, as shown in Fig. [2] 



Bi(x) = { 1 


Si(x) = { 



Ax 

elsewhere } 



for Xi-i < x < x i+ i 





(x — Xi)/ Ax 
1 



(6) 



for x < Xi ; 

for Xi < x < Xi+i ; 

for x > x i+ i } (7) 



Using these, we can represent the function y{x) by a linear 
interpolating function comprising straight line segments con- 
necting the sample points. The CA can in this case be readily 
transformed into an MLP expression by utilizing the relations: 



B i+ i{x) = Si{x) - S l+1 (x) 



Si(x) 



B i+ i(x) 



Bi(x) + B i+1 (x) 



(8) 



(9) 



The result is: 



yiBi(x) 



- y N B N {x) 



B 1 (x) + .. . + B N (x) 

ViBijx) 

B 1 (x) + ... + B N (x) 



VnB n (x) 



B 1 (x) + . . . + B N (x) 



Bi(x) + B 2 {x) 
. . . + un-iBn-i{,x) + 



y N B N (x) 



(10) 



Biv-i(x) + B]v(x) 

In the denominator of the first and last terms of this expres- 
sion, only those basis functions are kept that differ from zero 
in the region where the basis function in the numerator also 
differs from zero. The denominator in terms of index 2 to 
TV — 1 is 1 because of the overlapping of neighboring basis 
functions. We insert relations of Eq. ( 18191 ) into Eq. ( fTOb and 
obtain 



N-l 



y(x) = yi + Yl (yi+i ~ y*) Si ( x ) 



(id 



By introducing the parameters: Ayi — yt + \ — y^, Ci = 
l/(xi+i - af»), 6 4 = Xi/(x i+ i - Xi) and a unique, nor- 
malized sigmoidal basis function: 



S{x) = { 




x 
1 



for x < ; 
for < x < 1 
for x > 1 } 



(12) 



we can write Eq. ( fTTT i in the form of a two-layer perceptron 
mapping relation 

N-l 

y(x) =yi + J2 A ^ S ^ x - Q *) (13) 

i=l 

The first layer corresponds to neurons with synaptic weights 
Ci and threshold values 0^, while the second layer contains a 
linear neuron with synaptic weights Ayi and threshold y\. 

The above derivation demonstrates that for the two- 
dimensional distribution the mapping x — > y determined by 
the conditional average is identical with the mapping relation 
of a multi-layer perceptron. However, a difference appears 
when the operations needed for the mapping are executed. 
The operators involved in both cases are described by different 
basis functions, which correspond to different neurons in the 
implementation. If the prototypes are not evenly spaced, then 
the last equation can still be applied, although the transition 
regions will be of different spans. However, in this case the 
basis functions Bi (x) are no longer symmetric. In applications 
it is more convenient to use a Gaussian basis function rather 
than a triangular one, and in the perceptron expression this 
yields the function tanh(. . .). In this case, the estimated 
function y(x) generally does not run through the sample points 
but rather approximates them by a function having a more 
smooth derivative than the piecewise linear function. In this 
case, the correspondence between RBFN and MLP is not exact 
but approximate. 

An additional interpretation is needed when the data are not 
related by a regular function y{x) but randomly, as described 
by a joint probability density function f(x,y). In this case, 
various values of y can be observed at a given x. Evaluation 



of CA in this case is not problematic, while in the perceptron 
relation Eq. < fT~3T > the value must be substituted by the 
conditional average of variable y at Xj. 

The analysis of the correspondence between RBFN and 
MLP can be extended to multi-variate mappings. Let us first 
consider the situation with just two prototypes q, and and 
Gaussian basis functions. The CA is then described by the 
function 



h(g) 



h, exp( ^P^ ) + hj exp( - |ls 2 ;f J " 2 ) 



exp( 



-) +cxp( 



[lg-gj 



2<7 2 



-) 



(14) 



We introduce the notation: gi = g— Ag , gj = g+Ag , hj = 
h — Ah , hj = h + Ah in which the overline denotes the 
average value and 2Ag is the spacing of the prototypes. If 
we express the norm by a scalar product and cancel the term 
exp[— (|| g — g || 2 + || Ag || 2 )/2cr 2 ) in the numerator and 
denominator, we obtain the expression: 

h(g) =h + Ahtanh[Ag- (g-g)/d 2 ] (15) 

in which • denotes the scalar product. In order to obtain the 
relation between RBFN and MLR we introduce a weight 
vector c = Ag/cr 2 and a threshold value = g • Ag/cr 2 
into Eq. < fT3T > and obtain: 



h(g) = h + Ah tanh [c ■ (g - g) - 0] 



(16) 



This expression again describes a two-layer perceptron: the 
first layer is composed of one neuron having the synaptic 
weights described by the vector c and the threshold value 
0. The second layer is composed of linear neurons having 
synaptic weights Aft,,; and threshold values hi. 

The first-order approximation of the mapping expression Eq. 
[16] is: 



h(g) = h + AhAg • (g - g)/a 2 



(17) 



This equation represents a linear regression of h on g that runs 
through both prototype points if we assign a 2 =\\ Ag || 2 . Its 
slope is determined by the covariance matrix £ = AhAg T . 
However, the nonlinear regression specified in Eq. (fT~5b follows 
a linear regression only in the vicinity of a point determined by 
g and h while it exhibits saturation when g runs from g over 
given prototypes to infinity. The saturation is a consequence 
of the function tanh(. . .), which is basic in the modeling of 
a multi-layered perceptron. 

The reasoning presented above for a multi-variate case 
requires additional explanation when transferred to a situation 
consisting of many prototypes. Let us assume that N proto- 
types with indexes 1 ... AT can be found in the hyper-sphere of 
radius approximately a around the given datum g, and let these 
prototypes be spaced by approximately equal distances. The 
CA can now be expressed with leading terms and remainders 
as follows : 



Eti^cxpt- llg-g, || 2 /2a 2 ) 



h(g) 



Ei=i ex p(- 



/2cj 2 



■O v 



O h (18) 



Here O/, and O w represent two remainders, which are small 
in comparison with the two leading terms. We again introduce 
the average value, but now with respect to N prototypes: gi = 



g + Agi , h, = h + Ahi for i = 1 . . . N. With this we obtain 
the approximate expression : 



h( g ) a h + E h Ahi ' (g " S)/ f ] (19, 



Ei=i exp[Ag, • (g - g)/o- 2 ] 

For g in the vicinity of the average value, a linear approxima 
tion of the exponential function is applicable, which yields 



1 N 

h(g) = h + - ^ Ah, A gl • (g - g)/a 2 



(20) 



This expression represents a linear regression of h on g 
specified by N points. If we express the covariance matrix 



E = if Ah,Ag7 



(21) 



i=l 



by two principal vectors Ah p and Ag p : 

S = Ah p AgJ (22) 

we obtain a simplified expression of the linear regression 

h(g) =h + Ah p Ag p -(g-g)/a 2 (23) 

which is an approximation of an MLP mapping relation 

h(g) = h + Ah p tanh[Ag p • (g - g)/a 2 ] (24) 

The parameters of a single neuron in the perceptron expression 
thus correspond to the principal vectors of the covariance 
matrix £ = Ah p AgJ determining a local regression around 
the center of several neighboring prototypes. 

The above expression shows that the transition from RBFN 
to MLP can be quite generally performed. However, in the 
multi-variate case, the decomposition of CA into a perceptron 
mapping is not as simple as in the one-dimensional case, 
because the interpretation of perceptron parameters goes over 
local regression determined by various prototypes surrounding 
the given datum g. In spite of this, our conjecture is that 
both paradigms are equivalent with respect to the statistical 
modeling of mapping relations, provided that both models 
include the same number of adaptable parameters. 



closest neighbors, and additional smoothing is not needed. 
The corresponding parameters of the perceptron for one- 
dimensional mapping can then be simply interpreted in terms 
of prototypes, as described by the model equations Eq. [4] and 
Eq. [24] However, due to the complexity of the self-organized 
formation of prototypes determining the RBFN and the back- 
propagation learning of the MLP, it would be difficult to find 
an exact mapping relation between both models, especially in 
the multivariate case. 
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V. Conclusion 

The conditional average representing a linear interpolating 
function by the regular function y(x) shown in Fig. Q] can be 
exactly decomposed into the multilayer perceptron relation. 
When there are a small number of noise-corrupted sample 
data points representing the function, the question of proper 
smoothing arises. In the case of CA this is done by using sym- 
metric radial basis functions and increasing their width. The 
basis functions centered at various points then overlap, which 
results in a smoother y(x). Because of multiple overlapping, 
the relations between radial basis and sigmoidal functions 
becomes more complicated, and the transition between the 
conditional average and the perceptron relation becomes less 
obvious. However, when the prototypes are obtained by self- 
organization, they represent a statistical regularity, and the CA 
generally does not exhibit statistical fluctuations. In this case, 
the proper RBF width corresponds to the distance between 



